EDA is an iterative cycle where you:
Generate questions about your data.
Search for answers by visualizing, transforming, and modeling your data.
Use what you learn to refine your questions and/or generate new questions.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Your goal during EDA is to develop an understanding of your data. Two types of questions will always be useful for making discoveries within your data:
Which values are the most common? Why?
Which values are rare? Why? Does that match your expectations?
Can you see any unusual patterns? What might explain them?
The histogram below suggests several interesting questions:
Why are there more diamonds at whole carats and common fractions of carats?
Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
Why are there no diamonds bigger than 3 carats?
diamonds %>%
filter(carat < 3) %>%
ggplot(mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.01)
Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:
How are the observations within each cluster similar to each other?
How are the observations in separate clusters different from each other?
How can you explain or describe the clusters?
Why might the appearance of clusters be misleading?
Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
ggplot(data = diamonds, aes(x = x)) +
geom_histogram(binwidth = 0.05)
ggplot(data = diamonds, aes(x = y)) +
geom_histogram(binwidth = 0.05)
ggplot(data = diamonds, aes(x = z)) +
geom_histogram(binwidth = 0.05)
x appears to have values of 0s which would indicate some data entry error and z and y have outliers on the right hand side.
summary(select(diamonds, x, y, z))
## x y z
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.700 Median : 5.710 Median : 3.530
## Mean : 5.731 Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :10.740 Max. :58.900 Max. :31.800
The values of 0 are likely errors and the max values are suspicious.
filter(diamonds, x == 0 | y == 0 | z == 0)
We can check for outliers by plotting the relationship between two variables.
ggplot(data = diamonds, aes(x = x, y = y)) +
geom_point()
ggplot(data = diamonds, aes(x = x, y = z)) +
geom_point()
ggplot(data = diamonds, aes(x = y, y = z)) +
geom_point()
To figure out what x, y, and z measure, we could look at some summary statistics.
summarise(diamonds, x_bar = mean(x), y_bar = mean(y), z_bar = mean(z))
z seems to be smaller than x and y so I would say that z is the depth and x and y are the length and widths.
Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
summary(diamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18823
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(binwidth = 5)
# this bin is too narrow
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(binwidth = 5) +
coord_cartesian(xlim = c(0, 2000))
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(binwidth = 50) +
coord_cartesian(xlim = c(0, 2000))
How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
diamonds %>% count(carat == 0.99)
diamonds %>% count(carat == 1)
# or
diamonds %>%
filter(carat == 0.99 | carat == 1) %>%
count(carat)
diamonds %>%
filter(carat == 0.99 | carat == 1) %>%
ggplot(aes(x = carat)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# or
ggplot(data = diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.01) +
coord_cartesian(xlim = c(0.99, 1))
People are more likely to buy 1 carat diamonds, so retailers round up the size.
Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
ggplot(data = diamonds, aes(x = price)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = diamonds, aes(x = price)) +
geom_histogram() +
coord_cartesian(xlim = c(0, 5000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = diamonds, aes(x = price)) +
geom_histogram() +
xlim(0, 5000)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 14714 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
xlim() and ylim() influence actions before the calculation of the stats and coord_cartesian() applies after the stats have been calculated.
nycflights13::flights %>%
mutate(cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60) %>%
ggplot(aes(x = sched_dep_time, color = cancelled)) +
geom_freqpoly(binwidth = 0.25)
What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
Missing values are removed from histograms.
diamonds2 <-
diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y),
cut = ifelse(cut == "Fair", NA, as.character(cut)))
ggplot(data = diamonds2, mapping = aes(x = y)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9 rows containing non-finite values (stat_bin).
In a bar chart, NA values are considered a different category.
ggplot(data = diamonds2, aes(x = cut)) +
geom_bar()
What does na.rm = TRUE do in mean() and sum()?
na.rm = TRUE removes missing values before preforming the operation.
Instead of displaying the count on the y-axis of a geom_frepoly() plot, we can display the density, which is the count standardized so that the area under each frequency polygon is one.
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
Use what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.
nycflights13::flights %>%
mutate(cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60) %>%
ggplot(aes(x = sched_dep_time, y = ..density.., color = cancelled)) +
geom_freqpoly(binwidth = 0.25)
What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
head(diamonds)
cor(diamonds$price, select(diamonds, carat, x, y, z))
## carat x y z
## [1,] 0.9215913 0.8844352 0.8654209 0.8612494
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point()
ggplot(data = diamonds, aes(x = carat, y = cut)) +
geom_boxplot()
ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
geom_point()
There is a weak correlation between carat and cut. People would rather buy a larger diamond at the expense of cut quality.
Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?
ggplot(data = diamonds, aes(x = carat, y = cut)) +
geom_boxplot() +
coord_flip()
#install.packages("ggstance")
library(ggstance)
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
ggplot(data = diamonds, aes(x = carat, y = cut)) +
geom_boxploth()
One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?
#install.packages("lvplot")
library(lvplot)
ggplot(data = diamonds, aes(x = cut, y = price)) +
geom_lv()
Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?
ggplot(data = diamonds, aes(x = cut, y = price)) +
geom_violin()
ggplot(data = diamonds, aes(x = price)) +
geom_histogram() +
facet_wrap(~ cut, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = diamonds, aes(x = price, color = cut)) +
geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar togeom_jitter(). List them and briefly describe what each one does.
How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?
ggplot(data = diamonds, aes(x = color)) +
geom_bar() +
facet_wrap(~ cut, ncol = 1)
Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
nycflights13::flights %>%
group_by(dest, month, year) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(x = factor(month), y = dest, fill = avg_delay)) +
geom_tile()
## `summarise()` has grouped output by 'dest', 'month'. You can override using the `.groups` argument.
# removing destinations without 12 months of flights
nycflights13::flights %>%
group_by(dest, month, year) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
group_by(dest) %>%
filter(n() == 12) %>%
ungroup() %>%
ggplot(aes(x = factor(month), y = reorder(dest, avg_delay), fill = avg_delay)) +
geom_tile()
## `summarise()` has grouped output by 'dest', 'month'. You can override using the `.groups` argument.
Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = cut, y = color)) +
geom_tile(mapping = aes(fill = n))
Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when usingcut_width() vs cut_number()? How does that impact a visualization of the 2d distribution of carat and price?
ggplot(data = diamonds, aes(x = price, color = cut_number(carat, 5))) +
geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = diamonds,
aes(x = price, color = cut_width(carat, width = 1, boundary = 0))) +
geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Visualize the distribution of carat, partitioned by price.
ggplot(data = diamonds,
aes(x = carat, color = cut_width(price, 2500, boundary = 0))) +
geom_density()
ggplot(data = diamonds,
aes(x = carat, y = cut_width(price, 2000, boundary = 0))) +
geom_boxplot()
ggplot(data = diamonds, aes(x = carat, y = cut_number(price, 10))) +
geom_boxplot()
How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?
There is much less variation in the price of small diamonds compared to large diamonds.
Combine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.
ggplot(data = diamonds, aes(x = carat, y = price, fill = cut)) +
geom_hex()
ggplot(data = diamonds, aes(x = cut_width(carat, width = 1, boundary = 0), y = price, color = cut)) +
geom_boxplot()
Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Why is a scatterplot a better display than a binned plot for this case?
A binned plot could hide the outliers.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = cut_width(x, width = 1), y = cut_width(y, 1))) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))